AITopics | speech production

Collaborating Authors

speech production

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Limit cycles for speech

Gafos, Adamantios I., Kuberski, Stephan R.

arXiv.org Artificial IntelligenceDec-5-2025

Rhythmic fluctuations in acoustic energy and accompanying neuronal excitations in cortical oscillations are characteristic of human speech, yet whether a corresponding rhythmicity inheres in the articulatory movements that generate speech remains unclear. The received understanding of speech movements as discrete, goal-oriented actions struggles to make contact with the rhythmicity findings. In this work, we demonstrate that an unintuitive -- but no less principled than the conventional -- representation for discrete movements reveals a pervasive limit cycle organization and unlocks the recovery of previously inaccessible rhythmic structure underlying the motor activity of speech. These results help resolve a time-honored tension between the ubiquity of biological rhythmicity and discreteness in speech, the quintessential human higher function, by revealing a rhythmic organization at the most fundamental level of individual articulatory actions.

artificial intelligence, oscillator, speech, (15 more...)

arXiv.org Artificial Intelligence

2512.04642

Country: Europe > Germany (0.29)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area (0.47)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

The Dynamic Articulatory Model DYNARTmo: Dynamic Movement Generation and Speech Gestures

Kröger, Bernd J.

arXiv.org Artificial IntelligenceNov-12-2025

The neural generation and control of speech utterances is a complex process that is still not fully understood. However, several neurobiologically inspired models have been proposed that describe the hierarchical control concept of utterance generation (e.g., Hickok and Poeppel (2012); Bohland et al. (2010); Kröger et al. (2020); Parrell et al. (2018)). This process begins with the neural activation of the cognitive-linguistic representation of an utterance, followed by a higher-level premotor representation, leading to neuromuscular activation patterns, and finally to the articulatory-acoustic realization of the utterance (cf.

artificial intelligence, closing, gesture score, (18 more...)

arXiv.org Artificial Intelligence

2511.08372

Country: Europe > Germany (0.28)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.67)

Technology: Information Technology > Artificial Intelligence > Cognitive Science (0.95)

Add feedback

Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics

Zhang, Ziqian, Huang, Min, Xiao, Zhongzhe

arXiv.org Artificial IntelligenceNov-12-2025

Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.

machine learning, natural language, recognition, (18 more...)

arXiv.org Artificial Intelligence

2511.07955

Genre: Research Report > New Finding (0.34)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns

Kröger, Bernd J.

arXiv.org Artificial IntelligenceNov-7-2025

We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.

artificial intelligence, dynartmo, kr oger, (16 more...)

arXiv.org Artificial Intelligence

2507.20343

Country:

Europe > Germany (0.28)
North America > United States (0.28)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Speech (0.70)

Add feedback

MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction

Al-Radhi, Mohammed Salah, Németh, Géza, Gerazov, Branislav

arXiv.org Artificial IntelligenceAug-6-2025

Speech synthesis from intracranial EEG (iEEG) signals offers a promising avenue for restoring communication in individuals with severe speech impairments. However, achieving intelligible and natural speech remains challenging due to limitations in feature representation, prosody modeling, and phase reconstruction. We introduce MiSTR, a deep-learning framework that integrates: 1) Wavelet-based feature extraction to capture fine-grained temporal, spectral, and neurophysiological representations of iEEG signals, 2) A Transformer-based decoder for prosody-aware spectrogram prediction, and 3) A neural phase vocoder enforcing harmonic consistency via adaptive spectral correction. Evaluated on a public iEEG dataset, MiSTR achieves state-of-the-art speech intelligibility, with a mean Pearson correlation of 0.91 between reconstructed and original Mel spectrograms, improving over existing neural speech synthesis baselines.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2508.03166

Country: Europe > Hungary (0.15)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition

Foley, Sean, Nguyen, Hong, Lee, Jihwan, Kadiri, Sudarsana Reddy, Byrd, Dani, Goldstein, Louis, Narayanan, Shrikanth

arXiv.org Artificial IntelligenceJun-2-2025

Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models' attention weights highlight differences in acoustic and articulatory timing for certain phonemes.

artificial intelligence, machine learning, recognition, (16 more...)

arXiv.org Artificial Intelligence

2505.24059

Country: North America > United States > California (0.29)

Genre: Research Report > New Finding (0.67)

Industry: Health & Medicine (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.69)

Add feedback

Articulatory Feature Prediction from Surface EMG during Speech Production

Lee, Jihwan, Huang, Kevin, Avramidis, Kleanthis, Pistrosch, Simon, Gonzalez-Machorro, Monica, Lee, Yoonjeong, Schuller, Björn, Goldstein, Louis, Narayanan, Shrikanth

arXiv.org Artificial IntelligenceMay-30-2025

We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.13814

Country: North America > United States > California (0.29)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.35)
Health & Medicine > Diagnostic Medicine (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Decoding Phone Pairs from MEG Signals Across Speech Modalities

de Zuazo, Xabier, Navas, Eva, Saratxaga, Ibon, Bourguignon, Mathieu, Molinaro, Nicola

arXiv.org Artificial IntelligenceMay-22-2025

Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.

accuracy, artificial intelligence, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.15355

Country:

Europe (1.00)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area > Neurology (1.00)
Health & Medicine > Health Care Technology (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

Pillai, Leena G, Mubarak, D. Muhammad Noorul, Sherly, Elizabeth

arXiv.org Artificial IntelligenceApr-28-2025

Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

articulatory feature, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2504.18099

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

The order in speech disorder: a scoping review of state of the art machine learning methods for clinical speech classification

Moell, Birger, Aronsson, Fredrik Sand, Östberg, Per, Beskow, Jonas

arXiv.org Artificial IntelligenceMar-3-2025

Background:Speech patterns have emerged as potential diagnostic markers for conditions with varying etiologies. Machine learning (ML) presents an opportunity to harness these patterns for accurate disease diagnosis. Objective: This review synthesized findings from studies exploring ML's capability in leveraging speech for the diagnosis of neurological, laryngeal and mental disorders. Methods: A systematic examination of 564 articles was conducted with 91 articles included in the study, which encompassed a wide spectrum of conditions, ranging from voice pathologies to mental and neurological disorders. Methods for speech classifications were assessed based on the relevant studies and scored between 0-10 based on the reported diagnostic accuracy of their ML models. Results: High diagnostic accuracies were consistently observed for laryngeal disorders, dysarthria, and changes related to speech in Parkinsons disease. These findings indicate the robust potential of speech as a diagnostic tool. Disorders like depression, schizophrenia, mild cognitive impairment and Alzheimers dementia also demonstrated high accuracies, albeit with some variability across studies. Meanwhile, disorders like OCD and autism highlighted the need for more extensive research to ascertain the relationship between speech patterns and the respective conditions. Conclusion: ML models utilizing speech patterns demonstrate promising potential in diagnosing a range of mental, laryngeal, and neurological disorders. However, the efficacy varies across conditions, and further research is needed. The integration of these models into clinical practice could potentially revolutionize the evaluation and diagnosis of a number of different medical conditions.

accuracy, disorder, speech, (17 more...)

arXiv.org Artificial Intelligence

2503.04802

Country:

Europe > Germany > Saarland > Saarbrücken (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(3 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)
Research Report > Experimental Study (0.67)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
Health & Medicine > Therapeutic Area > Neurology > Parkinson's Disease (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (0.68)

Add feedback